Teaching Computers to See Patterns in Scatterplots with Scagnostics

Abstract:

An abstract of less than 150 words.

Cite PDF Tweet
Harriet Mason https://www.britannica.com/animal/quokka (Monash University) , Stuart Lee https://stuartlee.org (Genentech) , Ursula Laa https://uschilaa.github.io (University of Natural Resources and Life Sciences) , Dianne Cook https://dicook.org (Monash University)

Introduction

Visualising high dimensional data is often difficult and requires a trade-off between the usefulness of the plots and maintaining the structures of the original data. Scagnostics (scatterplot diagnostics) are a set of visual features that can be used to identify interesting and abnormal scatterplots, and thus give a sense of priority to the variables we choose to visualise. This proposal will discuss the creation of an R package that will provide a user-friendly method to calculate these scagnostics. The package will be tested on datasets with known interesting visual features to ensure the scagnostics are working as expected, before finally being used to explore and describe a time series dataset.

As the number of dimensions in a dataset increases, the process of visualising its structure and variable dependencies becomes more tedious. This is because the number of possible pairwise plots rises exponentially with the number of dimensions. Datasets like Anscombe’s quartet (Anscombe 1973) or the datasaurus dozen (Locke and D’Agostino McGowan 2018) have been constructed such that each pairwise plot has the same summary statistics but strikingly different visual features. This design is to illustrate the pitfalls of numerical summaries and the importance of visualisation. This means that despite the issues that come with increasing dimensionality, visualisation of the data cannot be ignored. Scagnostics offer one possible solution to this issue.

The term scagnostics was introduced by John Tukey in 1982 (Tukey 1988). Tukey discusses the value of a cognostic (a diagnostic that should be interpreted by a computer rather than a human) to filter out uninteresting visualisations. He denotes a cognostic that is specific to scatter plots a scagnostic. Up to a moderate number of variables, a scatter plot matrix (SPLOM) can be used to create pairwise visualisations, however, this solution quickly becomes infeasible. Thus, instead of trying to view every possible variable combination, the workload is reduced by calculating a series of visual features, and only presenting the outlier scatter plots on these feature combinations.

There is a large amount of research into visualising high dimensional data, most of which focuses on some form of dimension reduction. This can be done by creating a hierarchy of potential variables, performing a transformation of the variables, or some combination of the two. Unfortunately none of these methods are without pitfalls. Linear transformations are subject to crowding, where low level projections concentrate data in the centre of the distribution, making it difficult to differentiate data points (Diaconis and Freedman 1984). Non-linear transformations often have complex parameterisations, and can break the underlying global structure of the data, creating misleading visualisations. While there are solutions within these methods to fix these issues such as a burning sage tour for crowding (U. Laa, Cook, and Lee 2020) or liminal package for maintaining global structure (Lee, Laa, and Cook 2020) all these methods still involve some transformation of the data. Scagnostics gives the benefit of allowing the user to view relationships between the variables in their raw form. This means they are not subject to the linear transformation issue of crowding, or the non-linear transformation issue of misleading global structures. That being said, only viewing pairwise plots can leave our variable interpretations without context. Methods such as those shown in ScagExplorer (Dang and Wilkinson 2014) try to address this by visualising the pairwise plots in relation to the scagnostic measures distribution, but ultimately the lack of context remains one of the limitations of using scagnostics alone as a dimension reduction technique.

Scagnostics are not only useful in isolation, they can be applied in conjunction with other techniques to find interesting feature combinations of the transformed variables. The tourr projection pursuit currently uses a selection of scagnostics to identify interesting low level projections and move the visualisation towards them (U. Laa and Cook 2020). Since scagnostics are not dependent on the type of data, they can also be used to compare and contrast scatter plots regardless of the discipline. In this way, they are a useful metric for something like the comparisons described in A self-organizing, living library of time-series data, which tries to organise time series by their features instead of on their metadata (Fulcher et al. 2020).

Several scagnostics have been previously defined in Graph-Theoretic Scagnostics (L. Wilkinson, Anand, and Grossman 2005), which are typically considered the basis of the visual features. They were all constructed to range [0,1], and later scagnostics have maintained this scale. The formula for these measures were revised in Scagnostic Distributions and are still calculated according to this paper (Leland Wilkinson and Wills 2008). In addition to the main nine, the benefit of using two additional association scagnostics were discussed in Katrin Grimm’s PhD thesis (Grimm 2016). These two association measures are also used in the tourr projection pursuit (U. Laa and Cook 2020).

There are two existing scagnostics packages, scagnostics (Leland Wilkinson and Wills 2008) and the archived package binostics (Ursula Laa et al. 2020). Both are based on the original C++ code from Scagnostic Distributions (something?), which is difficult to read and difficult to debug. Thus there is a need for a new implementation that enables better diagnosis of the scagnostics, and better graphical tools for examining the results.

This paper describes the R package, cassowaryr that computes the currently existing scagnostics, and adds several new measures. The paper is organised as follows. The next section explains the scagnostics. This is followed by a description of the implementation. Several examples using collections of time series and XXX illustrate the usage.

Scagnostics

Building blocks for the graph-based metrics

In order to capture the visual structure of the data, graph theory is used to calculate most of the scagnostics. The pairwise scatter plot is re-constructed as a graph with the data points as vertices and the edges are calculated using Delaunay triangulation. In the package this calculation is done using the alphahull package (Pateiro-Lopez, Rodriguez-Casal, and. 2019) to construct an object called a scree. This is the basis for all the other objects that are used to calculate the scagnostics (except for monotonic, dcor and splines which use the raw data). The graph (screen object) is then used to construct the three key structures on which the scagnostics are based; the convex hull, alpha hull and minimum-spanning tree (MST).

The building blocks for graph-based scagnostics

Figure 1: The building blocks for graph-based scagnostics

Graph-based scagnostics

The nine scagnostics defined in Scagnostic Distributions are detailed below with an explanation, formula, and visualisation. We will let A= alpha Hull C= convex hull, M = minimum spanning tree, and s= the scagnostic measure. Since some of the measures have some sample size dependence, we will let w be a constant that adjusts for that.

\[s_{convex}=w\frac{area(A)}{area(C)}\]

\[s_{skinny}= 1-\frac{\sqrt{4\pi area(A)}}{perimeter(A)}\]

\[s_{outlying}=\frac{length(M_{outliers})}{length(M)}\]

\[s_{stringy} = \frac{|V^{(2)}|}{|V|-|V^{(1)}|}\]

\[s_{skewed} = 1-w(1-\frac{q_{90}-{q_{50}}}{q_{90}-q_{10}})\]

\[s_{sparse}= wq_{90}\]

\[\max_{j}[1-\frac{\max_{k}[length(e_k)]}{length(e_j)}]\]

\[\frac1{|V|}\sum_{v \in V^{2}}I(cos\theta_{e(v,a)e(v,b)}<-0.75)\]

\[s_{monotonic} = r^2_{spearman}\]

Association-based scagnostics

The two additional scagnostics discussed by Katrin Grimm are described below.

\[s_{splines}=\max_{i\in x,y}[1-\frac{Var(Residuals_{model~i=.})}{Var(i)}]\]

\[s_{dcor}= \sqrt{\frac{\mathcal{V}(X,Y)}{\mathcal{V}(X,X)\mathcal{V}(Y,Y)}}\]
where \[\mathcal{V} (X,Y)=\frac{1}{n^2}\sum_{k=1}^n\sum_{l=1}^nA_{kl}B_{kl}\]
where \[A_{kl}=a_{kl}-\bar{a}_{k.}-\bar{a}_{.j}-\bar{a}_{..}\] \[B_{kl}=b_{kl}-\bar{b}_{k.}-\bar{b}_{.j}-\bar{b}_{..}\]

Checking the scagnostics calculations

Maybe use Anscombe and datasaurus and the features data here

Software implementation

Installation

Data sets

Functions

Scagnostics functions

Drawing functions

Summary functions

Tests

Examples

Collections of time series

GOAL: Use scagnostics to find difference in shapes between groups. Here we want to first use features to describe a time series, and then secondly choosing pairs of features where there is the biggest difference between groups according to a scagnostic.

A paragraph describing the compenginets data

Analysis notes:

Compare two sets of time series

This analysis compares the features of macroeconomic and microeconomic series, using scagnostics. The goal of the comparison is to compare shapes, not necessarily centres of groups as might be done in LDA or other machine learning methods.

Here, just a small set of features is examined (because code fragile) but what emerges as interesting is the difference between curvature and trend strength. Microeconomic series tend to have high values on trend strength, and a range of values on curvature. In comparison macroeconomic series tend to have near constant average values on curvature, and highly varied on trend strength.

Plotting a few series actually suggests that the microeconomic series contain lots of micro structure, which might be what we should expect. Interestingly the trend strength seems to pick up the jaggies!

Parkinsons

Black holes and neutron star mergers?

AFL player statistics

Some explanation of AFL, and about the AFLW competition. Particularly explain any stats used in the plots: goals, kicks, posessions, …

The Australian Football League Women’s (AFLW) is the national semi-profesisonal Australia Rules football league for female players. Here we will analyse data sourced from the official AFL website with information on the 2020 season, in which the league had 14 teams and 1932 players. There are 68 variables, 38 of which are numeric. The others are categorical, like the players names or match ids, which would not be used in scagnostic calculations. These numeric variables are recorded per player per game: - timeOnGroundPercentage: percentage of the game the player was on the field.
- goals: the 6 points a team gets when the kick the ball between the two big posts.
- behinds: the 1 point a team gets when they kick the ball between the big post and small post.
- kicks: number of kicks done by the player in this game.
- handballs: number of handballs does by the player in the game.
- disposals: the number kicks and handballs a player has.
- marks: total number of marks in the game (the ball travels more than 15m and the player catches it without another player touching it or it hitting the ground).
- bounces: the number of times a player bounced the ball in a game. A player must bounce the ball if they travel more than 15m and they can only bounce the ball once.
- tackles: Number of tackles performed by the player.
- contestedPossessions: the number of disposals a player has under pressure, i.e if a player is getting tackled and the get a handball or kick out of the scuffle.
- uncontestedPossessions: the number of disposals a player has under no pressure where they have space and time to get rid of the ball.
- totalPossessions: The total number of time the player has the ball. - inside50s: the number of times the player has the ball within the 50m arc around the oponents goals.
- marksInside50: the number of marks a player gets within the 50m arc around the oponents goals.
- contestedMarks: the number of marks a player has under pressure. - hitouts: this is how many times a player or team taps or punching the ball from a stoppage. - onePercenters: all the things a player can do without registering a disposal. Eg. Spoils (punching the ball to stop someone from marking it), Shepparding (blocking for a teammate), smothering.
disposalEfficiency: a measure of how well a player disposes of the ball. E.g. if a player kicks or handballs to the opposition a lot, they will have a low disposal efficiency percentage.
- clangers: this is how many times a player or team dispose of the ball and it results in a turnover to the other team.
- freesFor: this player was awarded a free kick.
- freesAgainst: this player caused a free kick to be awarded to the other team.
- dreamTeamPoints: this is fantasy football scoring points.
- rebound50s: how many times the player exits the ball out of their defence 50m arc.
- goalAssists: number of times the player gave the pass immediately before the player that scored a goal. - goalAccuracy: percentage ratio of the number of goals kicked to the number of goal attempts.
- turnovers: this players disposal caused a turnover (the ball touches the ground and the other team get it).
- intercepts: number of times this player intercepts the disposal of the other team. - tacklesInside50: number of tackles performed by this player within their defence 50m arc.
- shotsAtGoal: number of total shots at goal for this player (sum of goals, behinds and misses) - scoreInvolvements: number of times the player was involved in a passage of play leading up to a goal. - metresGained: how far a player has been able to advance the ball without turning it over.
- clearances.centreClearances: this is the clearance from the centre bounce after a goal or at the start of a quarter - clearances.stoppageClearances: all the clearance from stoppages around the ground - clearances.totalClearances: how many time a player or team clears the ball from a stoppage or from the centre

With 38 variables, there are 703 possible scatterplots to make. The scagnostics can suggest which are the interesting ones to examine.

Var1 Var2 splines
totalPossessions disposals 0.94
clearances.totalClearances clearances.stoppageClearances 0.88
goalAccuracy goals 0.83
metresGained kicks 0.77
dreamTeamPoints disposals 0.74
disposals kicks 0.72
dreamTeamPoints totalPossessions 0.72
totalPossessions uncontestedPossessions 0.68
dreamTeamPoints kicks 0.67
uncontestedPossessions disposals 0.66

Figure 2: Scatterplots with high values on the splines scagnostic. Mouseover to examine the players relative the the statistics.

Figure @(fig:aflwinteractive) shows three scatterplots that score highly on the splines scagnostic. Each of these shows a relatively strong monotonic relationship between the two variables. In the interactive version of the plot, mouse over reveals some high-performing players, e.g. Anne Hatchard has a lot of possessions, disposals and kicks, and Kaitlyn Ashmore kicked 4 goals in a match with 100% accuracy.

NOTE: Each player is represented multiple times here, I think. The stats are per game. Maybe it is better to aggregate for each player and re-do the statistics?

(#fig:some_are_kickers)Some players tend to kick the ball, even when challenged, whereas others more often use handball for disposals.

The scagnsotics need to be used and interpreted with the type of dataset you are working with in mind. For example, since these are sports stats, almost all of the variables are discrete. This means in the case of the striated varibale, we would be interested in the scatter plots that are very low on striated rather than high. Lets see which striated measure can find some interesting scatter plots.

Why the lowest values on striated? This is a discrete data set, if all the points are not at right angles or in a stright line, to each other, they are not just randonly spread on the. Since the old striated measure is specifically trying to find a continuous variable against a discrete variable, its highest values are also identified by the striated_adjusted. The lowest values on striated simply identify a plot where all the variables are at right angles, once again a measure of disceteness but one that is not identified by striated. Striated_adjusted encapsulates both versions of discreteness in the values that get exactly a 1, this means the scatter plot that gets the lowest value should be the two variables that are continuous. Following that striated_adjusted gives some interesting scatter plots in both goal and dispsoal accuracy vs number. comment on the scatter plots.

World Development Indicators

The World Bank delivers a lot of development indicators, for many countries and multiple years. This might be a good example to identify pairs of indicators with interesting relationships.

Download data from:

https://databank.worldbank.org/source/world-development-indicators/preview/on#

Summary

Anscombe, F. J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1): 17–21. https://doi.org/10.1080/00031305.1973.10478966.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.
Dang, Tuan Nhon, and Leland Wilkinson. 2014. “ScagExplorer: Exploring Scatterplots by Their Scagnostics.” In 2014 IEEE Pacific Visualization Symposium, 73–80. https://doi.org/10.1109/PacificVis.2014.42.
Diaconis, Persi, and David Freedman. 1984. “Asymptotics of Graphical Projection Pursuit.” The Annals of Statistics 12 (3): 793–815. http://www.jstor.org/stable/2240961.
Fulcher, Ben D, Carl H Lubba, Sarab S Sethi, and Nick S Jones. 2020. “A Self-Organizing, Living Library of Time-Series Data.” Scientific Data 7 (1): 213–13.
Grimm, Katrin. 2016. “Kennzahlenbasierte Grafikauswahl.” Doctoral thesis, Universität Augsburg.
Laa, U., and D. Cook. 2020. Using Tours to Visually Investigate Properties of New Projection Pursuit Indexes with Application to Problems in Physics.” Computational Statistics 35: 1171–1205. https://doi.org/10.1007/s00180-020-00954-8.
Laa, U., D. Cook, and S. Lee. 2020. “Burning Sage: Reversing the Curse of Dimensionality in the Visualization of High-Dimensional Data.” arXiv: Computation.
Laa, Ursula, Hadley Wickham, Dianne Cook, and Heike Hofmann. 2020. “Binostics: Computing Scagnostics Measures in r and c++.” https://github.com/uschiLaa/paper-binostics.
Lee, Stuart, Ursula Laa, and Dianne Cook. 2020. “Casting Multiple Shadows: High-Dimensional Interactive Data Visualisation with Tours and Embeddings.” http://arxiv.org/abs/2012.06077.
Locke, Steph, and Lucy D’Agostino McGowan. 2018. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.
Pateiro-Lopez, Beatriz, Alberto Rodriguez-Casal, and. 2019. Alphahull: Generalization of the Convex Hull of a Sample of Points in the Plane. https://CRAN.R-project.org/package=alphahull.
Tukey, John. 1988. “The Collected Works of John w. Tukey.” In, edited by William S. Cleveland, 411, 427, 433. Chapman; Hall/CRC.
Wilkinson, L., A. Anand, and R. Grossman. 2005. “Graph-Theoretic Scagnostics.” In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005., 157–64.
Wilkinson, Leland, and Graham Wills. 2008. “Scagnostics Distributions.” Journal of Computational and Graphical Statistics 17 (2): 473–91.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".